Overview

Dataset Statistics

Number of Variables 4
Number of Rows 80230
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 32.6 MB
Average Row Size in Memory 425.5 B
Variable Types
  • Numerical: 1
  • Categorical: 3

Dataset Insights

idx is skewed Skewed
docstring_tokens has a high cardinality: 56780 distinct values High Cardinality
code_tokens has a high cardinality: 65851 distinct values High Cardinality
url has a high cardinality: 80230 distinct values High Cardinality
url has all distinct values Unique

Variables


idx

numerical

Approximate Distinct Count 79560
Approximate Unique (%) 99.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 1283680
Mean 202457.1517
Minimum 11
Maximum 1.405e+06
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • idx is skewed right (γ1 = 2.1703)

Quantile Statistics

Minimum 11
5-th Percentile 13123.45
Q1 48332.25
Median 87832.5
Q3 134752.75
95-th Percentile 959460
Maximum 1.405e+06
Range 1.405e+06
IQR 86420.5

Descriptive Statistics

Mean 202457.1517
Standard Deviation 287574.8404
Variance 8.2699e+10
Sum 1.6243e+10
Skewness 2.1703
Kurtosis 3.929
Coefficient of Variation 1.4204
  • idx is not normally distributed (p-value 2.7568632282256763e-11)
  • idx has 17289 outliers

docstring_tokens

categorical

Approximate Distinct Count 56780
Approximate Unique (%) 70.8%
Missing 0
Missing (%) 0.0%
Memory Size 10573857

Length

Mean 65.894
Standard Deviation 50.4707
Median 54
Minimum 2
Maximum 1108

Sample

1st row ['Python3', 'imple...
2nd row ['Stores', 'the', ...
3rd row ['Traverse', 'the'...
4th row ['Stores', 'the', ...
5th row ['Traverse', 'the'...

Letter

Count 2625301
Lowercase Letter 2519157
Space Separator 575120
Uppercase Letter 106144
Dash Punctuation 5712
Decimal Number 23133
  • docstring_tokens contains many words: 7680 words
  • The largest value (function) is over 1.53 times larger than the second largest value (driver)

code_tokens

categorical

Approximate Distinct Count 65851
Approximate Unique (%) 82.1%
Missing 0
Missing (%) 0.0%
Memory Size 18590390
  • The largest value (['if', '__name__', '==', "' _ _ main _ _ '", ':', 'NEW_LINE']) is over 1.53 times larger than the second largest value (['import', 'math', 'NEW_LINE'])

Length

Mean 151.7518
Standard Deviation 144.0961
Median 105
Minimum 0
Maximum 3233

Sample

1st row ['def', 'maxPresum...
2nd row ['X', '=', 'max', ...
3rd row ['for', 'i', 'in',...
4th row ['Y', '=', 'max', ...
5th row ['for', 'i', 'in',...

Letter

Count 3903576
Lowercase Letter 1918212
Space Separator 1732826
Uppercase Letter 1985364
Dash Punctuation 24719
Decimal Number 177121
  • code_tokens contains many words: 15361 words
  • The largest value (new_line) is over 3.82 times larger than the second largest value (1)

url

categorical

Approximate Distinct Count 80230
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Memory Size 6292016

Length

Mean 13.4247
Standard Deviation 0.6503
Median 13
Minimum 10
Maximum 15

Sample

1st row 10005-Python-1
2nd row 10005-Python-2
3rd row 10005-Python-3
4th row 10005-Python-4
5th row 10005-Python-5

Letter

Count 481380
Lowercase Letter 401150
Space Separator 0
Uppercase Letter 80230
Dash Punctuation 160460
Decimal Number 435226
  • url contains many words: 80230 words